outage_duration. They have tracked three different outage durations, 0 for no outage, 1 for short outages that last anywhere between a few minutes and a maximum of 2 hours, and 2 for long outages that can last from 2 hours to sometimes even a couple of days.You will now have to use these metrics that the company has tracked to create a machine learning model that will be able to predict the outage_duration so that the company can better handle outages and improve customer satisfaction and therefore reduce customer churn.¶import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve, GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier
import xgboost as xgb
import operator
from scipy.stats import uniform as sp_rand
from scipy.stats import randint as sp_randint
import timeit
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")
tic0=timeit.default_timer()
pd.options.mode.chained_assignment = None # default='warn'
● train_data.csv: It has a unique event id for each observation of the outage_duration in a particular area_code
● test_data.csv: Similar to the train dataset, we are provided with an id and an area_code, we are expected to predict the outage_duration for each of the records. (This will be provided to you later on 4th July)
● broadband_data.csv: For each of the event ids mentioned in the train_data.csv and test_data.csv files and also some additional ids there is a record of the broadband_type that is stored in the dataset. There are 10 different types of broadbands that are observed in the dataset
● outage_data.csv: For each of the event ids mentioned in the train_data.csv and test_data.csv files and also some additional ids there is a record of the outage_type that is stored in the dataset. There are 5 different outage_type's recorded in the dataset.
● report_data.csv: For each event id there are log_report_type and volume columns are recorded. log_report_type is a type of the recorded report generated by a technical team member after evaluating the outage. volume is the volume of data handled in the area at the time of report in custom company specific units.
● server_data.csv: For each of the event ids mentioned in the train_data.csv and test_data.csv files and also some additional ids there is a record of the transit_server_type that is stored in the dataset. Transit Servers handle the requests and responses of the customers.
● sample_submission.csv: The format of CSV file required for submission to the evaluation backend. (Please remember that the prediction file which you are going to upload to tool, to check out what is your score should be of the same format as this file)
train_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/train_data.csv')
broadband_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/broadband_data.csv')
outage_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/outage_data.csv')
report_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/report_data.csv')
server_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/server_data.csv')
sample_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/sample_submission.csv')
train_data.shape, broadband_data.shape, outage_data.shape, report_data.shape, server_data.shape
print('The shape of broadband is: {}\n'.format(broadband_data.shape))
print('The shape of outage is: {}\n'.format(outage_data.shape))
print('The shape of report is: {}\n'.format(report_data.shape))
print('The shape of server is: {}\n'.format(server_data.shape))
print('The shape of train is: {}\n'.format(train_data.shape))
print('The shape of sample is: {}\n'.format(sample_data.shape))
---------------------------------------------------------------------
train_data.csv: It has a unique event id for each observation of the outage_duration in a particular area_code¶train_data.head()
train_data['area_code'].value_counts()
val=list(train_data['area_code'].value_counts())
for i in range(len(val)):
print("Area Code",train_data['area_code'].value_counts().index[i],round(val[i]/sum(val)*100),'%')
#count plot for Outage Duration
plt.figure(figsize = (14,6))
sns.countplot(train_data['outage_duration'])
plt.tight_layout()
plt.show()
0 for no outage1 for short outages that last anywhere between a few minutes and a maximum of 2 hours2 for long outages that can last from 2 hours to sometimes even a couple of days.import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=train_data['area_code'],
y=train_data['id'],
mode="markers",text=train_data["area_code"])]
layout=go.Layout(title="ID vs Area_code",xaxis={"title":"area_code"},yaxis=dict(title="ID"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()
---------------------------------------------------------------------
import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=train_data.loc[train_data.outage_duration==0,'area_code'],
y=train_data.loc[train_data.outage_duration==0,'id'],
mode="markers",text=train_data["outage_duration"]),
go.Scatter(x=train_data.loc[train_data.outage_duration==1,'area_code'],
y=train_data.loc[train_data.outage_duration==1,'id'],
mode="markers",text=train_data["outage_duration"]),
go.Scatter(x=train_data.loc[train_data.outage_duration==2,'area_code'],
y=train_data.loc[train_data.outage_duration==2,'id'],
mode="markers",text=train_data["outage_duration"])]
layout=go.Layout(title="ID vs Area Code",xaxis={"title":"area_code"},yaxis=dict(title="ID"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()
import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=train_data.loc[train_data.outage_duration==0,'area_code'],
y=train_data.loc[train_data.outage_duration==0,'id'],
mode="markers",text=train_data["outage_duration"])]
layout=go.Layout(title="ID vs Area Code with Outage Duration : 0",xaxis={"title":"area_code"},yaxis=dict(title="ID"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()
Not facing much problem¶import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=train_data.loc[train_data.outage_duration==1,'area_code'],
y=train_data.loc[train_data.outage_duration==1,'id'],
mode="markers",text=train_data["outage_duration"])]
layout=go.Layout(title="ID vs Area Code Outage Duration : 1",xaxis={"title":"area_code"},yaxis=dict(title="ID"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()
import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=train_data.loc[train_data.outage_duration==2,'area_code'],
y=train_data.loc[train_data.outage_duration==2,'id'],
mode="markers",text=train_data["outage_duration"])]
layout=go.Layout(title="ID vs Area Code Outage Duration : 2",xaxis={"title":"area_code"},yaxis=dict(title="ID"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()
train_data_outage = train_data.loc[train_data.outage_duration==2,'area_code'].value_counts()
train_data_outage
train_data['outage_duration'].value_counts()
#count plot for Outage Duration
plt.figure(figsize = (10,4))
sns.countplot(train_data['outage_duration'])
plt.tight_layout()
plt.show()
val=list(train_data['outage_duration'].value_counts())
for i in range(len(val)):
print("Outage type",train_data['outage_duration'].value_counts().index[i],round(val[i]/sum(val)*100),'%')
outage_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/outage_data.csv')
outage_data.head()
outage_data.shape
#count plot for Outage Duration
plt.figure(figsize = (14,6))
sns.countplot(outage_data['outage_type'])
plt.tight_layout()
plt.show()
outage_data['outage_type'].value_counts()
Here we can see the value counts of each outage type
val=list(outage_data['outage_type'].value_counts())
for i in range(len(val)):
print(outage_data['outage_type'].value_counts().index[i],round(val[i]/sum(val)*100),'%')
outage_data.outage_type=outage_data.outage_type.apply(lambda x: int(x.split("_")[2]))
outage_data.head()
outage_type_dummies = pd.get_dummies(outage_data,columns=['outage_type'])
outage_type_dummies.head()
outage_data.describe(include='all')
server_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/server_data.csv')
server_data.head()
server_data.shape
server_data.nunique().value_counts()
server_count = server_data['transit_server_type'].value_counts()
server_count
val=list(server_data['transit_server_type'].value_counts())
for i in range(len(val)):
print(server_data['transit_server_type'].value_counts().index[i],round(val[i]/sum(val)*100),'%')
# x and y given as array_like objects
import plotly.express as px
fig = px.scatter(server_data['id'], server_data['transit_server_type'])
fig.show()
server_count.head()
server_data.head()
server_data.transit_server_type=server_data.transit_server_type.apply(lambda x: int(x.split("_")[3]))
server_data.head(10)
#count plot for transit server type
plt.figure(figsize = (14,6))
sns.countplot(server_data['transit_server_type'])
plt.tight_layout()
plt.show()
server_data['transit_server_type'].value_counts().sum()
server_data.shape
server_type_dummies = pd.get_dummies(server_data,columns=['transit_server_type'])
server_type_dummies.head()
server_type_dummies.shape
server_data.describe(include='all')
server_data.shape
server_type_dummies.describe()
report_data.csv: For each event id there are log_report_type and volume columns are recorded. log_report_type is a type of the recorded report generated by a technical team member after evaluating the outage. volume is the volume of data handled in the area at the time of report in custom company specific units.¶report_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/report_data.csv')
report_data.head()
report_data.shape
report_data.nunique()
log_report_ = report_data['log_report_type'].value_counts()
log_report_.head(10)
val=list(report_data['log_report_type'].value_counts())
for i in range(len(val)):
print(report_data['log_report_type'].value_counts().index[i],round(val[i]/sum(val)*100),'%')
#count plot for Log Report Type
plt.figure(figsize = (14,6))
sns.countplot(report_data['log_report_type'])
plt.tight_layout()
plt.show()
report_data.log_report_type=report_data.log_report_type.apply(lambda x: int(x.split("_")[3]))
report_data.head()
#count plot for Log Report Type
plt.figure(figsize = (14,6))
sns.countplot(report_data['log_report_type'].value_counts())
plt.tight_layout()
plt.show()
log_report_.head(10)
report_data.reset_index(inplace=True)
report_data.rename(columns={'index':'count_of_log_report_seen'},inplace=True)
report_data.head()
report_data_value_counts = report_data.log_report_type.value_counts().to_dict()
report_data['num_ids_with_log_feature'] = report_data['log_report_type'].map(lambda x: report_data_value_counts[x])
report_data.head()
bins = [-10,0,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,
200,210,220,230,240,250,260,270,280,290,300,310,320,330,340,350,360,
370,380,390,400]
report_data['binned_log_report'] = np.digitize(report_data['log_report_type'], bins, right=True)
bins_offset = list(map(lambda x:x+5, bins))
report_data['binned_offset_log_report'] = np.digitize(report_data['log_report_type'], bins_offset, right=True)
np.digitize : Return the indices of the bins to which each value in input array belongs.
Another Feature for Binned offset log report
report_data.head()
report_data['position_of_log_report'] = 1
report_data['position_of_log_report'] = report_data.groupby(['id'])['position_of_log_report'].cumsum()
report_data['log_report_type'] = report_data['log_report_type'].astype(int)
import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=report_data.loc[report_data.log_report_type,'log_report_type'],
y=train_data.loc[report_data.log_report_type,'id'],
mode="markers",text=report_data["log_report_type"])]
layout=go.Layout(title="ID vs Log Report Type",xaxis={"title":"Log Report Type"},yaxis=dict(title="ID"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()
report_data.head()
report_data.describe(include='all')
report_data.shape
broadband_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/broadband_data.csv')
broadband_data.head()
broadband_data.shape
broadband_data.nunique()
broadband_data['broadband_type'].value_counts()
** From the data Broadband type 8 : 'ADSL 1' is used more in the given data, then broadband type 2 and so onn.
val=list(broadband_data['broadband_type'].value_counts())
for i in range(len(val)):
print(broadband_data['broadband_type'].value_counts().index[i],round(val[i]/sum(val)*100),'%')
#count plot for broadband type
plt.figure(figsize = (14,6))
sns.countplot(broadband_data['broadband_type'])
plt.tight_layout()
plt.show()
broadband_data.broadband_type=broadband_data.broadband_type.apply(lambda x: int(x.split("_")[2]))
broadband_data
import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=broadband_data.id,
y=broadband_data.broadband_type,
mode="markers",text=broadband_data["broadband_type"])]
layout=go.Layout(title="ID vs Broadband Type",xaxis={"title":"ID"},yaxis=dict(title="Broadband Type"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()
broadband_data.head()
broadband_type_dummies = pd.get_dummies(broadband_data,columns=['broadband_type'])
broadband_type_dummies
val=list(broadband_data['broadband_type'].value_counts())
for i in range(len(val)):
print(broadband_data['broadband_type'].value_counts().index[i],round(val[i]/sum(val)*100),'%')
broadband_data['broadband_shifted_up'] = broadband_data['broadband_type'].shift(1)
broadband_data['broadband_shifted_up'] = broadband_data['broadband_shifted_up'].astype('float64','ignore')
broadband_data.head()
broadband_data.dtypes
broadband_data['broadband_type']=broadband_data['broadband_type'].astype('int64')
broadband_data.isnull().sum()
broadband_data['broadband_shifted_up'].fillna(8,inplace=True)
broadband_data.head()
broadband_data.describe()
broadband_data['is_next_broadband_type_repeat'] = 0
broadband_data['is_next_broadband_type_different'] = 0
broadband_data.head()
repeat_cond = broadband_data['broadband_shifted_up'] == broadband_data['broadband_type']
repeat_cond.value_counts
broadband_data['is_next_broadband_type_repeat'][repeat_cond] = 1
broadband_data.head()
broadband_data['is_next_broadband_type_repeat'][~repeat_cond] = 1 # ~ Binary Ones Complement
broadband_data.head()
broadband_data['switches_broadband_type'] = broadband_data['is_next_broadband_type_different'].cumsum()
broadband_data['switches_broadband_type'].value_counts()
broadband_data
broadband_data.nunique().value_counts
broadband_data.describe()
broadband_data.dtypes
broadband_data.shape
print("Number of unique ID's in Train Data : ",train_data.id.nunique())
print("Number of unique ID's in Broadband Data : ",broadband_data.id.nunique())
print("Number of unique ID's in Server Data : ",server_data.id.nunique())
print("Number of unique ID's in Outage Data : ",outage_data.id.nunique())
print("Number of unique ID's in Report Data : ",report_data.id.nunique())
print('The shape of train is: {}\n'.format(train_data.shape))
print('The shape of broadband is: {}\n'.format(broadband_data.shape))
print('The shape of server is: {}\n'.format(server_data.shape))
print('The shape of outage is: {}\n'.format(outage_data.shape))
print('The shape of report is: {}\n'.format(report_data.shape))
train_broadband_combined = pd.merge(train_data,broadband_data,left_on = ['id'],
right_on = ['id'],how='left')
train_broadband_combined.shape
train_broadband_combined.head()
● Which areas are most prone to long outage durations?
● Which broadband types are suspect of long outage durations?
● Any other recommendations to improve the detection of outage durations.
train_broadband_outage = train_broadband_combined.loc[train_broadband_combined.outage_duration==2,'broadband_type'].value_counts()
train_broadband_outage
#count plot for broadband type
plt.figure(figsize = (14,6))
sns.countplot(train_broadband_combined['broadband_type'])
plt.tight_layout()
plt.show()
long_outage = train_data['outage_duration']==2
long_outage.head()
long_outage_ = train_data[long_outage]
long_outage_.head()
count=long_outage_.area_code.value_counts()
count.head(10).plot(kind="bar",figsize=(10,10))
count.head(10)
def str_to_num(string):
return int(string.split("_")[1])
train_data["area_code"]=train_data["area_code"].apply(str_to_num)
train_data.head()
merge_1 = train_data.merge(outage_data,how="left",left_on=["id"],right_on=["id"])
merge_1.head()
merge_1.set_index("id",inplace=True)
merge_1.describe()
merge_1["num"]=merge_1.groupby(['area_code',"outage_duration"]).cumcount()+1
merge_1[(merge_1["area_code"]==1) & (merge_1["outage_duration"]==1)]
Num represents number of outages of each type in an area to understand in which area what kind of outage is more so that we can focus on reducing it¶merge_1.num.value_counts()
data=[go.Histogram(x=merge_1["num"])]
layout=go.Layout(title="Histogram")
layout=go.Layout(title="number of cases",xaxis={"title":"number of cases"},yaxis=dict(title="count of no of cases"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()
import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=merge_1.loc[merge_1.outage_duration==0,'area_code'],
y=merge_1.loc[merge_1.outage_duration==0,'num'],
mode="markers",text=merge_1["outage_duration"]),
go.Scatter(x=merge_1.loc[merge_1.outage_duration==1,'area_code'],
y=merge_1.loc[merge_1.outage_duration==1,'num'],
mode="markers",text=merge_1["outage_duration"]),
go.Scatter(x=merge_1.loc[merge_1.outage_duration==2,'area_code'],
y=merge_1.loc[merge_1.outage_duration==2,'num'],
mode="markers",text=merge_1["outage_duration"])]
layout=go.Layout(title="num vs area_code",xaxis={"title":"area_code"},yaxis=dict(title="num"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()
server_data
merge_2=server_data.merge(train_data, on='id')
merge_2.head()
merge_2.isnull().sum()
merge_2.nunique()
merge_2.shape
transit_server_type_unq=pd.DataFrame(merge_2.transit_server_type.value_counts())
transit_server_type_unq.head()
transit_server_type_unq["percen_trn"]=merge_2.transit_server_type.value_counts(normalize=True)*100
transit_server_type_unq
merge_2.head()
Num to Group by Server Type and Area Code¶merge_2["num"]=merge_2.groupby(['transit_server_type','area_code']).cumcount()+1
merge_2.head()
import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=merge_2.loc[:,'area_code'],
y=merge_2.loc[:,'num'],
mode="markers")]
layout=go.Layout(title="Number of Servers vs Area Code",xaxis={"title":"area_code"},yaxis=dict(title="number of servers"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()
train_data_ = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/train_data.csv')
broadband_data_ = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/broadband_data.csv')
train_broadband_combined_ = pd.merge(train_data,broadband_data_,left_on = ['id'],
right_on = ['id'],how='left')
long_outage_2 = train_broadband_combined_['outage_duration']==2
long_outage_2.head()
long_outage_2_=train_broadband_combined_[long_outage_2]
long_outage_2_.head()
broad_band=long_outage_2_.broadband_type.value_counts()
broad_band.head(10).plot(kind="bar",figsize=(10,10))
broad_band.head(10)
train_data_ = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/train_data.csv')
server_data_ = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/server_data.csv')
train_server_combined_1 = pd.merge(train_data_,server_data_,left_on = ['id'],
right_on = ['id'],how='left')
train_server_combined_1
long_outage_3 = train_server_combined_1['outage_duration']==2
long_outage_3.head()
long_outage_3_=train_server_combined_1[long_outage_3]
long_outage_3_.head()
server_outage=long_outage_3_.transit_server_type.value_counts()
server_outage.head(10)
server_outage.head(10).plot(kind="bar",figsize=(10,10))
train_broadband_outage.shape
train_data.head(), outage_data.head(), broadband_data.head(), server_data.head(), report_data.head()
train_data.shape, broadband_data.shape, outage_data.shape, report_data.shape, server_type_dummies.shape
train_server_merge_1 = pd.merge(train_data,server_type_dummies,left_on = ['id'],
right_on = ['id'],how='left')
train_server_merge_1.head(),train_server_merge_1.shape
train_outage_merge_02 = pd.merge(train_data,outage_data,left_on = ['id'],
right_on = ['id'],how='left')
train_outage_merge_02.head(), train_outage_merge_02.shape
train_report_merge_03 = pd.merge(train_data,report_data,left_on = ['id'],
right_on = ['id'],how='left')
train_report_merge_03.head(),train_report_merge_03.shape
train_broadband_merge_04 = pd.merge(train_data,broadband_data,left_on = ['id'],
right_on = ['id'],how='left')
train_broadband_merge_04.head(), train_broadband_merge_04.shape
train_server_outage = pd.merge(train_server_merge_1,train_outage_merge_02,left_on = ['id'],
right_on = ['id'],how='left')
train_server_outage.head(), train_server_outage.shape
train_broadband_report = pd.merge(train_broadband_merge_04,train_report_merge_03,left_on = ['id'],
right_on = ['id'],how='left')
train_broadband_report.head(), train_broadband_report.shape
### Final Merge
final_merge = pd.merge(train_server_outage,train_broadband_report,left_on = ['id'],
right_on = ['id'],how='left')
final_merge.head()
final_merge.shape
final_merge.nunique().value_counts
type_=pd.crosstab(final_merge['outage_duration_x_x'],final_merge['outage_type'])
type_.plot(kind="bar", stacked=True, figsize=(8,6))
type_
dat=pd.crosstab(final_merge['broadband_type'],final_merge['outage_duration_x_x'])
dat.plot(kind="bar", stacked=True, figsize=(8,6))
dat